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Abstract. We show that for any two values a, (3 > 0 for which a + /3 > 1 then there is a value 
N so that for all n > N the following holds. For any binary phylogenetic tree T on n leaves 
there is a set of characters that capture T, and for which each character takes at most 
[n ^J distinct states. Here ‘capture’ means that T is the unique perfect phylogeny for these 
characters. Our short proof of this combinatorial result is based on the probabilistic method. 


Given a function / : A' —>• S let n(f) denote the partition of A induced by the equivalence 
relation x ~ x' if and only if f(x) = f(x'). If | 7 r(/)| < r we say that f takes at most r states (this 
is equivalent to saying |/(A)| < r, and such characters are also referred to as ‘r-state characters’ 
elsewhere). Given a (unrooted) phylogenetic A-tree T (i.e. a tree leaf set X and no vertices of 
degree 2) / : X —> S is said to be a character on A' and / is convex on T if the minimal subtrees 
of T connecting the leaves of each block of the partition 7 r(/) are vertex disjoint. The condition of 
/ being convex has a natural interpretation in biology of the character / being ‘homoplasy-free’ 
(for details, see [ 6 ]). Now suppose we are given a set C of characters on A. In this case T is said 
to be a perfect phylogeny for C if each of the characters in C are convex on T. Moreover, C is 
said to capture T if T is the only perfect phylogeny for C, in which case every non-leaf vertex of 
T must have degree 3. 

Suppose that C is a set of k characters, each of which takes at most r states. Then a fundamen¬ 
tal inequality states that k must be at least [(n — 3)/(r— 1)] (Proposition 4.2 of { 6 ]). Remarkably, 
this lower bound was recently shown |pQ to be sharp for every fixed value of r > 1 , provided that 
n > N r , where N r is some (increasing) function of r (e.g. N 2 = 3, IV 3 = 13 [T]). In other words, 
for every r > 1, and every unrooted binary phylogenetic A'-tree T, where n = |A| > N r , there 
is a set C oi \(n — 3)/(r — 1)] characters that captures T and with each character in C taking at 
most r states. 

In this note, we consider how small k can be when r is allowed to depend on n (we write 
r = r n ). From mm it is known that there exists a set C of k = 4 characters for which the 
associated number n r of states satisfies n/r n = 0(1). Thus we focus on the setting where both 
r n and n/r n grow with increasing n. More precisely, suppose that we want a set C n consisting of 
k n = |n Q J characters on A', each taking at most r n = \n l3 \ states, to capture some phylogenetic 
A'-tree, where a,/3 > 0. Notice that the inequality k > \(n — 3)/(r — 1)] implies that k n must 
exceed n 1- ^ for n sufficiently large, thus a + /3 > 1. We show here that any value of a,/3 > 0 
with a + /3 > 1 allows for such a set C n and for any binary tree T. 

The following result is independent of the result from the main theorem of [T] mentioned 
above, in the sense that neither result directly implies the other. Our short proof involves 
a simple application of the probabilistic method, the Chernoff bound, and a property of the 
random cluster model on trees established in [5]. 

Theorem 1. For any two values a, (3 > 0 for which a + (3 > 1 there is a value N so that for all 
n > N the following holds. For any unrooted binary phylogenetic tree T on a leaf set X of size 
n there is a set C n of k n = [n a J characters on X that capture T, and for which each character 
takes a most r n = \n^\ distinct states. 
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Proof. Consider the following random process performed on T. Each edge of T is independently 
cut with probability p n = r ra /4ro, or is left intact with probability 1 —p n . This leads to a partition 
of X corresponding to the equivalence relation that two leaves are related if and only if they lie in 
the same connected component of the resulting graph. We will associated to each such partition 
a character that induces this same partition (e.g. the character f : X —> 2 X which maps x to its 
equivalence class under ~). Notice the number of ‘states’ of this associated character is simply 
the number of blocks of the original partition). 

Let Y denote the random number of edges of T that are cut. Then Y has a binomial distribu¬ 
tion Y ~ Bin(2n — 3 ,p n ), which has mean p n = (2 n — 3 )p n = (- — o(l))n^. By a multiplicative 
form of the ‘Chernoff bound’ in probability theory (c./. [3], Eqn. ( 6 ) with e = 1) we have 

P(y > 2p n ) < exp(— p n /?i) and since r n > 2p n we obtain: 

(1) P(Y > r n ) < exp(-fj, n /3). 

The number of blocks of the partition of X induced by randomly cutting edges of T in the 
process described is at most Y + 1. Thus, the probability that a character, generated by the 
random cluster model with p n value as specified, takes strictly more than r n states is at most 
P(Y + 1 > r n ) = P(Y > r n ) < exp(-/i n /3), by (QJ. 

Let us generate a set C n of k n such characters independently by the process described (i.e. 
constructing partitions of X and for each partition giving an associated character). The proba¬ 
bility that at least one of these characters has more than r n states is, by Boole’s inequality, at 
most 

n ^ —> 0 

as n —> oo (recall /3 > 0). Thus, there exists some value Ni for which, for any n > Aq, , at least 
one character in C n takes more than r n states with probability at most 1 /s 

What is the probability that C n captures T1 As part of a more general analysis of the 
(infinite state) random cluster model by |5], Lemma 2.2 and Theorem 2.4 of that paper show 
that C n captures T with probability at least 1 — e provided that k = [— log(n 2 /e)], where 
B = p n (2 - - X—) A ~ p n (as n oo). Now, 

1 1 An An , , R 

B p n r n nP 

and since a + /3 > f it follows that for any e > 0: 

-j-log(n 2 /e)/n“ ~ 4n 1 ~ /3 log(n 2 /e)/n a —>• 0 as n —» oo. 

So taking e = V 3 ) there is a value N 2 for which, for any n > N 2 , we have log(n 2 /e)] < |yi“J 
for all n> N 2 . Thus, with k n = [n a J where n > N 2 , C n fails to capture T with probability at 
most !/ 3 . 

Combining these two observations, if we set N = max{JVi,A^} then for all n > N, the 
probability that a set C n of randomly-generated characters satisfies at least one of the 
following properties: 

(i) C n contains a character that takes more than r n states, or 

(ii) C n fails to captures T, 

is at most 1 /z + l /s = 2 / 3 , by Boole’s inequality. Thus there is a strictly positive probability that 
C n satisfies neither of condition (i) and (ii), and so there must exist a set of [n a \ characters, each 
taking at most \n^\ states, which captures T. This completes the proof. □ 

Remark: Notice from the proof, that the condition a + /? > 1 can be replaced by a + /? = 1 
if we allow k n = [n a J characters to be replaced by k n = |_ra a J (8 + c) log(n), for any c > 0 . 


n a exp(-^„/3) = n a exp ( -i(i - o(l)) 
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